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Abstract 

We consider swapping of two records in a microdata set for the purpose of 
disclosure control. We give some necessary and sufficient conditions that some 
observations can be swapped between two records under the restriction that a given 
set of marginals are fixed. We also give an algorithm to find another record for 
swapping if one wants to swap out some observations from a particular record. Our 
result has a close connection to the construction of Markov bases for contingency 
tables with given marginals. 

Keywords and phrases: decomposable model, disclosure control, graphical model, hierar- 
chical model, Markov basis, primitive move. 



1 Introduction 



In statistical disclosure control of microdata sets, swapping of observations among records 
is considered to be a convenient disclosure control technique, especi ally because it pre- 
serves one - dimensional rna.rginals. Data swapping was introduced by Dalenius and Reisi 
jl 982^ and lSchlorerl |l98l| . IXakemura [2002] conside red opt i mal p airi ng of close records of a 
micro data set to perform swapping. As explained in Dobra 2003^ and Dobra and SuUivant 
|2004| . swa pping has a close connection to the theory of Markov bases for contingency 
tables. See Willenborg and de Waall 2001 1 for a review of disclosure control techniques 
for microdata sets. 
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Suppose that a statistical agency is considering to grant access to a microdata set 
to some researchers and the data set contains some rare and risky records. We consider 
the case that all variables of the data set have been already categorized. Swapping of 
observations is one of the useful techniques of protecting these records. If some marginals 
from the data set have been already published, it is desirable to perform the swapping 
in such a way that the swapping does not disturb the published marginal frequencies. 
Therefore it is important to determine, whether it is possible to perform swapping of 
risky records under the restriction that some marginal are fixed. See lTakemura and Endol 
[2006^ for a realistic example of the need for swapping. 

Feasibility of swapping under the restriction that some marginal are fixed depends on 
the set of fixed marginals. We here illustrate this point by a simple hypothetical example. 
Suppose that a microdata set contains the following two records. 

sex age occupation residence 



male 55 nurse Tokyo 
female 50 police officer Osaka 



If we swap "occupation" among these two records we obtain 

sex age occupation residence 



male 55 police officer Tokyo 
female 50 nurse Osaka 



By this swapping the one-dimensional marginals are preserved, but the two-dimensional 
marginal of {age, occupation} is disturbed. If we swap both age and occupation we obtain 

sex age occupation residence 



male 50 police officer Tokyo 
female 55 nurse Osaka 



and {age, occupationj-marginal is also preserved. 

This simple example shows that observations can be freely swapped if we fix only the 
one-dimensional marginals, but some observations have to be swapped together to keep 
two-dimensional marginals fixed. 

In fact if all two-dimensional marginals are fixed, then it is impossible to swap obser- 
vations between any two records without disturbing at least one of the two-dimensional 
marginals. This is because if some observations are swapped and some observations are 
not swapped between two records, then the two-dimensional marginal of a swapped vari- 
able and a non-swapped variable is disturbed. This fact is clarified in a general form in 
Theorem 13. II in Section ITTl 

Actually there is a possibility of swapping observations involving more than two records 
to keep all two-dimensional marginals fixed. We present an example of this possibility 
in Section 0J Swapping among more than two records is closely related to higher degree 
moves of Markov bases for contingency tables. It is well known that Marko v basis involving 



higher degree moves is very complicated (e.g. lAoki and Takemural j2003|). 
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In this paper we consider swapping between two records only and we give some nec- 
essary and sufficient conditions for swappability of two records such that a given set of 
marginals are fixed. We also give a practical algorithm to find another record for swap- 
ping if one wants to swap out some observations from a particular record. Our conditions 
are conveniently described in terms decompositions by minimal vertex separators of a 
graphical model ger ierated by the set of r nargin als. Results of the present paper are suc- 
cessfully applied in Takemura and Endol 2006l | to check swappability of risky records in 
a microdata set of a substantial size. 

The organization of this paper is as follows. In Section |21 we summarize notations 
and present some preliminary results including the equivalence of swapping between two 
records and a primitive move of a Markov basis. In Section El we give some necessary and 
sufficient conditions for swappability of two records. We also give an algorithm to find 
another record for swapping for a particular record. Some discussions are given in Section 
13] Technical details are postponed to Appendix. 



2 Preliminaries 

In this section we ffist setup appropriate notations and summarize some preliminary 
results for this paper. Consider an n x /c microdata set X consisting of observations on 
k variables for n individuals (records). As mentioned above we assume that the variables 
have been already categorized. Therefore we can identify the microdata set with a k- 
way contingency table, if we ignore the l abels o f the i ndivi duals. Concerning continge ncy 
tables, we mostly follow the notation in Dobra 2003l | and lPobra and SuUivant 2004^ . n 



denotes a fc-way contingency table. For positive integer m, {!,..., m) is denoted by [m]. 
Let A = [A;] = {1, . . . , A;} denote the set of variables. The cells of the contingency table 
are denoted by z = (zi, . . . , 4.) G X = [Ji] x ■ ■ ■ x [1^]. Each record of the microdata set 
falls into some cell i. n{i) denotes the frequency of cell i. If n{i) = 1, we say that the 
record falling into cell z is a sample unique record. 

For a subset D C A of variables, the D- marginal n£, of n is the contingency table 
with marginal cells io ^ Id = IljeDt-^jJ entries given by 

nDiin) = ^ n{iD,iDc)- 

Here we are denoting i = (icij^/c) by ignoring the order of the indices. 

Let E he a non-empty proper subset of A. For two records of X falling into cells 
i = {iE,iEc) and j = {jE^jE^), ^ 7^ i, swapping of i and j with respect to i? C A, or 
more simply E'-swapping, means that these records are changed as 

{iiE,iEc),ijEjEc)} {{iEjEc),{jE,iEc)}- (1) 

Note that i?-swapping is equivalent to ^'''-swapping. Also note that if z^; = Je ot Iec = 
Jec, then swapping in (|H) results in the same set of records. Therefore ((TJ results in a 
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different set of records if and only if 



iET^jE and lEc^jEC- (2) 

From now on we say that E'-swapping is effective if it results in a different set of records. 

We now ask when ii^-swapping fixes D-marginals. D-marginals are fixed by i?-swapping 
if and only if one of the following four conditions holds. 

i) D ^ E, ii) D d E^, iii) ie^d = Jehd, iv) lECnD = JeChd- (3) 

It is obvious that if one of the conditions holds, then D-marginals are not altered. On the 
other hand assume that all four conditions do not hold. Let Di = DCiE and D2 = Dr\E^ . 
These are non-empty because i) and ii) do not hold. Furthermore 7^ Jdi and 7^ jD2 
because iii) and iv) do not hold. Let in = (^Di, "^Dj)- Then nDiio) = 'noi'i 01,102) is 
decreased by 1 by this swapping and this particular D-marginal changes. 

So far we have only considered one marginal D. We need to consider a set of marginals 
V = {Di, . . . , Dr}. For simplicity throughout this paper we assume A = Ul^iDg. If 
Ul^-^^Dg is a proper subset of A, we can simply replace A by Ul^^Ds, because there is no 
restriction on frequency distributions involving variables in (Ul^iDs)'" . We investigate 
conditions for swapping two records such that all marginals in T> are fixed. Note that 
a smaller marginal can be computed by further summation of frequencies of a larger 
marginal. This implies that in T> we only need to consider , . . . , Dr, such that the re 
is no inclusion relation between them, i.e. V is an "antichain" ( Klain and Rotal |l997| ). 



Any antichain V is a generating class of a hierarchical model for the contingency table 



( LauritzenI |199( 

A hierarchical model with a generating class V is graphical if T> coincides with a set 
of (maximal) cliques of a graph G with vertex set A. A graphical model is decomposable 
if G is a chordal graph. 

Given a generating class V, we define a graph generated by V as follows. The 
vertex set of G^ is A. We put an edge between s, t G A if and only if there exists D eT) 
such that C D. Note that the graphical model associated with G^ is the smallest 

graphical model containing the hierarchical model with the generating class T). 

An integer array f = {/(i)}iGj is a move for T) if foiiD) = for all D E T). f is 
a primitive move for T) if it is a move for T) and furthermore if two entries of f are 1, 
two entries are —1 and the other entries are 0. Adding a move f to n, or applying f 
to n, obviously does not alter the D-marginal for every D E T>. It is intuitively clear 
that a primitive move and swapping of observations of two records are equivalent. In 



fact iDobral |2003j does not distinguish these two. However there is at least a conceptual 
difference between them, because a move is defined for a given set of marginals V whereas 
i?-swapping is defined only in terms of two records and a subset E. We give a proof of 
this equivalence in Appendix. 
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3 Necessary and sufficient conditions of swappability 



In this section we give some necessary and sufficient conditions for swappability of ob- 
servations between two records. In particular in Theorem Kill we state a necessary and 
sufficient condition in terms of an induced subgraph of , which is convenient for appli- 
cation. Then we describe a practical algorithm to find another record for swapping for a 
particular record. 

3.1 Swappability between two records 

In we have already given a necessary and sufficient condition for iJ-swapping to fix D- 
marginals. However Q is not very useful for considering simultaneous fixing of marginals 
inP = 

For clear argument it is better to distinguish variables which are common in two records 
and variables which have different values in two records. Note that if some variable has 
the same value in two records, swapping or no swapping of the variable do not make 
any difference. Therefore we should only look at variables taking different values in two 
records. Let 

A = {s\is^ Js] (4) 

denote the set of variables taking different values in two records. Note that © holds if 
and only if 

EnA^0 and E^nA^0. (5) 

Therefore i?-swapping effective if and only if i? fl A 7^ and fl A 7^ 0. In particular 
A has to contain at least two elements, because if A has less than two elements swapping 
between i and j can not result in a different set of records. 

We now show the following lemma. The following lemma says that the variables in 
A n D have to be swapped simultaneously or otherwise stay together in order not to 
disturb D-marginals. 

Lemma 3.1. An effective E-swapping fixes D-marginals if and only if A (1 D G E or 
AnD C E^ under 

Proof. We have to check that one of the four conditions in Q holds if and only if A fl D C 
E 01 ADD C E^. 

Assume that one of the four conditions in (jS)) holds. If D G E, then A (1 D G E. 
Similarly if D C i?*^, then An D G E^ . Now suppose iEnD = jsnD- Then 

= An (^nD) = (AnD) ^ AhdgE^. 

Similarly if iECnn = jECnD then An D G E. 

Conversely assume that AciD G E or AciD G E^ . In the former case A{~\Dr\E^ = 
and this implies iv) iEC^D = JE'^nD- Similarly in the latter case iii) iEnD = Jehd holds. □ 
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In the above lemma, E is given. Now suppose that two records z, j and a marginal D 
is given and we are asked to find a non-empty proper subset E G A such that i?-swapping 
is effective and fixes D-marginals. As a simple consequence of Lemma 13.11 we have the 
following lemma. 

Lemma 3.2. Given two records i,j and D G A, we can find E G A such that E-swapping 
is effective and fixes D-marginals if and only if An 7^ and \ A\ > 2. 

Proof If A n 7^ and |A| > 2, then choose s e A n D'^' and let E = {s} to be a 
one-element set. Then E satisfies the requirement. 

If |A| < 1, there is no i?-swapping resulting in a different set of records as mentioned 
above. On the other hand if A fl D'-' = or A C D, then by Lemma f3. II A C E. But this 
contradicts E'" fl A 7^ in (jSJ and there exists no E satisfying the requirement. □ 

Based on the above preparations we now consider the following problem. Let two 
records i,j and a set of marginals V = {Di, . . . , D^} be given. We are asked to find E 
such that i?-swapping fixes all marginals of V and results in a different set of records. We 
consider this problem in terms of a graphical model. In the previous section we introduced 
a graph generated by V. Let G^ denote the induced subgraph of G'^ where the vertex 
set is restricted to A. Note that G^ is a graph with the vertex set A and an edge between 
s, t G A if and only if there exists D gV such that {s, t} G D. 

Recall that the variables s and t belonging to some D gV either have to be swapped 
out simultaneously or stay together. It follows that any variable in a connected component 
of G^ has to be swapped out simultaneously or stay together simultaneously. Therefore 
we have the following theorem, which is the main theorem of this paper. 

Theorem 3.1. Given two records i,j and a generating class T>, we can find E G A such 
that E-swapping is effective and fixes all D-marginals, VD G T), if and only if G^ is not 
connected. 

Proof. As mentioned above, there exists no C A such that E'-swapping is effective and 
fixes all D-marginals in the case where G^ is connected. 

Conversely assume that is not connected. Let 7^ be a connected component of 
G^. Then for any two vertices {s,t} such that s G and t G A \ 7^ there exists no 
D G V satisfying {s, t} G D. Therefore if we set E = 7^, iJ-swapping is effective and 
fixes all D-marginals. □ 

For example let V consists of all two-element sets of A. This V corresponds to the 
hierarchical model containing all two-variable interaction terms but not containing any 
higher order interactions terms. For this V, G^ is the complete graph, corresponding to 
the saturated model. 

If T> consists of all two-element sets of A, i.e., if we have to fix all two-dimensional 
marginals, then G^ is complete and G^ is also complete. In particular G^ is connected 
and Theorem lH. ll savs that we can not find an effective swapping fixing all two-dimensional 
marginals. 
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Let be the set of the minimal vertex separators of . It is well known that any 
S G induces complete subgraph of when G^ is chordal, that is, P is a generating 
class of a decomposable model. Denote the induced subgraph of G^ to A \ by Ga\5- 
Let adj(a), a G A denote the set of vertices which are adjacent to a. Define adj(74) for 
v4 C A by adj(A) = |J^g^adj(5) \ A. Then we obtain the following lemma. 

Lemma 3.3. G^ is not connected if and only if there exist S G and two connected 
components 7q, and 7^ of ^^^5 such that 

5nA = 0, 7„nA^0, 7^nA^0. (6) 

Proof. Assume that G^ is not connected. Let 7^,1 and 7^,2 be any two connected com- 
ponents of Ga- For any pair of vertices (a,/5) such that a G 7^,1 and /? G 7a, 2; adj(7A,i) 
is a (a, /3)-separator (not necessarily minimal) in G^. Hence there exists Sa^p G such 
that Sa,/3 C adj(7A.i). If there does not exist 3^,(3 G S satisfying Sa,/3 fl A = 0, then 
adj(7A,i) n A 7^ 0, which contradicts that the intersections of 7a, 1 and other connected 
components of G^ are empty. Therefore there exists a minimal (a, /3)-separator such that 
Sa,i3 n A = 0. 

Since each of 7a, 1 and 7a,2 is a connected component, Sa,i3 satisfying S^,/?!"! A = also 
separates any pair of vertices in 7a, 1 and 7a,2 other than (a, Hence Sa,f3 separates 7a,i 
and 7a, 2 in G^. This implies that 7a,i and 7a,2 belong to different connected components 
of G^^^ . Therefore (jH)) is satisfied. 

On the other hand if there exist S, 7q, and 7/3 satisfying (jH)), it is obvious that G^ is 
not connected. □ 

By the above lemma, we have the following corollary. 

Corollary 3.1. Given two records i,j and a generating class T>, we can find E G A such 
that E-swapping is effective and fixes all D-marginals, \/D G V, if and only if there exist 
S G and two connected components ja and of G^^^ satisfying (0), that is, 

is = js, V ^ he, hi3 7^ h^- (7) 

Theorem 13.11 and Corollary 13.11 are applicable to general hierarchical models. If T) 
is a generating class of a graphical model associated with a graph G, then by definition 
G^ = G. Therefore we have the following corollary concerning a graphical model. 

Corollary 3.2. Let V he a generating class of a graphical model associated with a graph 
G. For two records i,j define A by We can find E G A such that E-swapping of i 
and j is effective and fixes all D-marginals, VD G V, if and only if there exist S G 
and two connected components and 7^3 of Ga\5 satisfying that is, 

is = js ) ifa 7^ jfa 5 i-tf3 7^ ^7/3 ' 
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3.2 Searching another record for swapping 



So far we have considered some necessary and sufficient conditions on i?-swapping between 
two records i,j to be effective and fix D-marginals for general hierarchical models. In 
this section we consider to find another record which is swappable for a particular sample 
unique record i by using the results in the previous section. 

Given a particular record i, by Corollary 13. H we could scan through the microdata 
set for another record j satisfying the conditions of Corollary 13.11 Instead of checking 
the conditions Corollarv 13.11 for each j, we could first construct the list iS^ of minimal 
vertex separators S and the connected components ja, Ip of G^^s- '^^^ ^ particular 
triple (5', 7q,,7/3) we could check whether there exists another record j satisfying ((7j) of 
Corollarv 13.11 Actually it is straightforward to check the existence of j satisfying ((7j). 
Since we require is = jsi we only need to look at the slice of the contingency table given 
the value of is- Then in this slice we look at {i-y^, z-y^ }-marginal table. By the requirement 
ha 7^ hohp 7^ ^7/3' omit the "row" i^^ and the "column" from the marginal table. 
If the resulting table is non-empty, then we can find another record j in a diagonal position 
to i and we can swap observations in j and i. See Figure ^ 



P 



a 



/ 


1 

1 
1 
1 




1 



Figure 1: j swappable with i in a diagonal position 



7a, /3 



More precisely, for 7q,, 7/3, write ^a,f3 = 7a U 7/3 US'. Define the subtable n, 
)by 



■7c«,/3V''7q 



n 



7Q,/3V''7a_^ I "Ic,/}/ 



n. 



^7Q,^>lS<:<„fl'' 



ha 7^ ha 5 hl3 7^ ^')'/3 ' 



7/3 ' ''S 



Let n 



7c«,^(V''7„,^ I ''la.13. 



7^ denote that there exists at least one positive count in n^^ 



7q,/3 



). Then we have the following lemma. Proof is obvious and omitted. 



fi ^ la,f3 



Lemma 3.4. There exists a record j with is = js, ha 7^ J-ya, ^'^^ hp 7^ hp ^/ ^'^'^ '^''^^V ^/ 



n 



■■ya,p\'''ra,P I na.p. 



7^0. 

Lemma [3.41 is easy to check. Therefore it remains to co mpute the set of mininial ver- 
tex separators and the connected components of G^s^^. Shiloach and Vishkin |l982l | 
proposed an algorithm for computing connecte d components of a graph. O n listing mini- 
mal vertex separators there exist algorithms by Berry et al. 200Q{ and .Kloks and Kratsch 
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[1998|- The input of their algorithms is . However in our case generating class V is 
given in advance. It may be possible to obtain more efficient algorithms if we also use the 
information of T) as the input. 

The following algorithm searches another record j which is swappable for a sample 
unique record i and swaps them if it exists. 

Algorithm 3.1 (Finding j swappable for i and swapping between i and j). 

Input : n, V, S^, i 

Output : a post-swapped table n' = {n'{i)} 
begin 
n' ^ n ; 

for every S G do 
begin 

compute connected components of G^^^ ; 

for every pair of connected components (70,7/3) do 

begin 

if n^ ^ \ „) 7^ then 

begin 

select a marginal cell t!^^^ such that n^^ ^^{i'^^ ^ \ ha p) ^ j 
select a cell ? G X such that 7'^ « = ^' „ : 

J J la, 13 la, (3 ' 

E <- 7a; 

i?-swapping between i and j; 
n'{i) <— n{i) — 1; 
n'{j) ^ n{j) - 1; 
n'iJE, iE'^) ^ n{jE, lE'^) + 1; 
n^iE^E-) n{iE,3E-) + 1; 
exit ; 
end if 
end for 
end for 

if n' = n then i is not swappable ; 
end 



In Takemura and Endol 2006l | we applied this algorithm to a microdata set of n = 9809 
records and A; = 8 variables. There were 2243 sample unique records. We fitted a 
decomposable model to the 8-way contingency table to identify 50 risky records among 
the 2243 sample unique records. We then applied Algorithm 13. ll to check whether these 50 
records are swappable or not. For most of these 50 records, Algorithm 13. II quicklv found 
another record for swapping. Therefore we found that Algorithm 13.11 is very practical in 
actual disclosure control procedures. 
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4 Some discussions 



In this paper we considered swapping among two records. As mentioned above, if all 
two-dimensional marginals are fixed, then we can not swap among two records without 
disturbing some marginal. However when we consider swapping among more than two 
records, there are cases where we can fix all two-dimensional marginals, as illustrated by 
the following example. Consider a table of 4 records with 3 variables. Each variable has 
two levels (1 or 2). 



Xi X2 Xs 



1 


1 


1 


1 


2 


2 


2 


2 


1 


2 


1 


2 



In this example there is exactly 1 frequency for each 2-marginal. If we now circularly 
rotate the observations of 2:3, we obtain the following table. 

Xi X2 X3 



1 1 2 

1 2 1 

2 2 2 
2 11 



Then all 4 records are changed but all two-dimension al marginals are preserved. In fact 
this example correspond to a basic move of degree 4 ( Diaconis and Sturmfelsl |l998| ) of 



the Markov basis for 2x2x2 contingency tables with fixed two-dimensional marginals. 
More complicated example s can be given by translating the moves of 3 x 3 x X tables of 



Ao 



d and Take mural 20031 . 



Dobrai ^2003| proved that there exists a Markov basis consisting of primitive moves 



for decomposable models. This implies the following fact in the case of decomposable 
models. If a particular record can be changed by swaps possibly involving more than 2 
records, then it is always possible to change the record by a swap involving the record 

and another single record^ 

On the other hand Geiger et al.l l200fil have shown that that primitive moves do 



not form a Markov basis for non-decomposable models. This implies that for non- 
decomposable models, there is a possibility of swapping of a sample unique record in- 
volving more than 2 records, even if it can not be swapped with another single record 
that can be checked by Algorithm 13.11 of Section 13.21 

The theory of Markov basis is concerned with the swappability of all records with 
arbitrary marginal counts. The investigation of this paper just asks whether a particular 
sample unique record can be swapped with other records in a particular data set. There- 
fore the problem considered here should be much easier than the problem of construction 
of Markov bases for general hierarchical models of contingency tables. Still it is not clear 
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at this point how to construct a practical algorithm for checking swappability of a partic- 
ular record involving other two records, other three records etc. This problem is left for 
our future research. 

A Equivalence of a primitive move and swapping of 
two records 

An effective E'-swapping ([T]) changes the cell frequencies of i, j, i', j' into 

n{i) n{i) - 1, n{j) n{j) - 1, n(i') -> n{i') + 1, n(j') n{j') + 1. (8) 

Hence the difference between the post-swapped and the pre-swapped tables is a primitive 
move. If iJ-swapping fixes all P-marginals, the corresponding primitive move also fixes 
them. 

Next we consider to show that any primitive move (jH)) for V can be expressed by 
E'-swapping for some E G A. Write 

i = {ii, . ..,ik), 3 = (ii,. . .,jk), i' = . j = ■ ■ ■, jO- 

We first show that {im,jm} = Wm^fm} for 1 < m < fc. Since IJt -^t = there exists 
t for any m such that m belongs to Dt. In the case where iot = joti two records of 
nDtiiot) have to be preserved in i'j^^ and j^^. Hence = j'j^^ = iot = jof the other 
hand if 7^ jot^ each one record of both njD^^ij:,^) and riD^^jD^) have to be preserved in 
{4i'iz?t}' which implies {iDtJot) = {i'Dt^fot}- Therefore we have {i^J^m} = {i'^n^i'm) 
for 1 < m < k. 
If we set 

E = {m\i^= j^m} = {m\i^= j'^}, 

E satisfies (^. This completes the proof of the equivalence of i?-swapping and primitive 
move for V. 
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